The Twofish Encryption Algorithm: A 128-Bit Block Cipher The Twofish Encryption Algorithm: A 128-Bit Block Cipher
by Bruce Schneier ; John Kelsey ; Doug Whiting ; David Wagner ; Chris Hall ; Niels Ferguson
Wiley Computer Publishing, John Wiley & Sons, Inc.
ISBN: 0471353817   Pub Date: 03/01/99
  

Previous Table of Contents Next


Alternate Constructions. It should also be noted that a very different type of construction could be used to preserve MDS properties on rotation. Use of an 8-by-8 MDS matrix over CF(24) will guarantee eight output nibble changes for every input nibble change. Because the changes now are nibble-based, a 1-bit rotation may shift the only non-zero bit of a nibble out of a byte, but the other nibble remains entirely contained in the byte. In fact, it can easily be seen that this construction preserves the MDS property nicely even for multi-bit rotations. Unfortunately, 8-by-8 MDS matrices over GF(24) are nowhere near as plentiful as 4-by-4 matrices over GF(28), so very little freedom is available to pick simple coefficients. The best construction for such a matrix seems to be an extended RS-(15,7) code over GF(24), which requires considerably more gates in hardware and more tables in smart card firmware than the Twofish matrix. Because of this additional cost, we decided not to use an 8-by-8 MDS matrix over GF(24).

Hamming
Weight
Number of
Occurrences
Number with
MSB Set
8 7/1020 1
9 23/1020 4
10 42/1020 15
Table 7.6. Hamming Weight Distribution of MDS Output after 1-byte Input Change

7.3.4 Rotational Uniqueness of Output Vectors

Another constraint imposed on the Twofish MDS matrix is that no row (or column) of the matrix be a rotation of another row (or column) of the matrix. This property guarantees that all single-byte input differences result in unique output differences, even when rotations over eight bits are applied to the output, as is done in generating the round subkeys. This constraint did not seem to limit the pool of candidate matrices significantly. In fact, the Twofish MDS matrix actually exhibits a much stronger property: all 1020 MDS output differences for single-byte input changes are distinct from all others, even under bitwise rotations by each of the rotation values in the range 6..26. The subkey generation routine takes advantage of this property to help thwart related-key differential attacks. For all single-byte input changes, the rotation of the output 32-bit word B by eight bits has an output difference that is guaranteed to be unique from output differences in the unrotated A quantity.

7.3.5 Maximizing the Minimal Hamming Distance

There are many MDS matrices containing only the three elements 01, EF, and 5B. The particular Twofish matrix was also chosen to maximize the minimum binary Hamming weight of the output differences over all single-byte input differences. The Twofish MDS matrix guarantees that any single-byte input change will produce an output Hamming difference of at least eight bits, in addition to the property that all four output bytes will be affected. In fact, as shown in Table 7.6, only seven of the 1020 possible single-byte input differences result in output differences with Hamming weight difference 8; the remaining 1013 differences result in higher Hamming weights. Also, only one of the seven outputs with Hamming weight 8 has its most significant bit set, meaning that at least eight bits will be affected even in the PHT term T0 + 2T1 for 1019 of the 1020 possible single-byte input differences. Input differences in two bytes are guaranteed to affect three output bytes, and they can result in an output Hamming difference of only three bits, with probability 0.000018; the probability that the output difference Hamming weight is less than eight bits for 2-byte input differences is only 0.00125, which is less than the corresponding probability (0.0035) for a totally random binomial distribution.

There are many other MDS matrices, using either the same or another set of simple field elements, that can guarantee the same minimum output Hamming difference, and the particular matrix chosen is representative of the class. No higher minimum Hamming weight was found for any matrices with such simple elements.

7.4 PHT

The PHT operation, including the addition of the round subkeys, was chosen to facilitate very fast operation on the Pentium CPU family using the LEA (load effective address) opcodes. The LEA opcodes allow the addition of one register to a shifted (by 1,2,4,8) version of another register, along with a 32-bit constant, all in a single clock cycle, with the result placed in an arbitrary Pentium register. For best performance, a version of the encryption and decryption code can be “compiled” for a given key, with the round subkeys inserted as constant values in LEA opcodes in the instruction stream. This is what was done for the “compiled” keying option explained in section 5.1.1. This approach requires a full instantiation in memory of the code for each key in use, but it provides a speedup for bulk encryption.

7.4.1 Eliminating the PHT

Instead of using four key-dependent S-boxes, a 4-by-4 MDS matrix, and the PHT, we could have used eight key-dependent S-boxes and an 8-by-8 MDS matrix over GF(28), hence eliminating the PHT. Such a construction would be easier to analyze and would have nicer properties, but it is much slower in virtually all implementations and would not be worth it.

7.4.2 Diffusion and the Least Significant Bit

One disadvantage of the PHT is that it does not provide complete diffusion for the special case of the least significant bit of T0 + 2T1. In particular, the least significant bit of T0 + 2T1 depends only on the least significant bit of T0, and thus depends only on half of the input bits to the Feistel function. This means that F does not exhibit complete diffusion: there is one output bit which is not affected by all input bits. In fact, we use this property in our differential attack in Section 9.2.

We could have replaced the PHT with a variant that exhibits better diffusion; however, this would have incurred a significant performance penalty. Is it better to eliminate this weakness, if the price is that we must reduce the number of rounds to retain the same performance?

We believe that the PHT yields an excellent bang-for-the-buck ratio. In practice, the PHT’s incomplete diffusion to the least significant bit of T0 + 2T1 should not be a problem, because full diffusion will be easily achieved in the next two rounds; also, the 1-bit rotation prevents the least significant bits from lining up across multiple rounds, which seems to prevent attacks based on analyzing least significant bits.

7.5 Key Addition

As noted in the previous section, the round subkeys axe combined with the PHT output via addition to enable optimal performance on the Pentium CPU family. From a cryptographic standpoint, an XOR operation could have been used, but it would reduce the best Pentium software performance for bulk encryption. It should be noted that using addition instead of XOR does impose a minor gate count and speed penalty in hardware, but this additional overhead was considered to be well worth the extra performance in software. On a smart card, using addition instead of XOR has virtually no impact on code size or speed.

7.6 Feistel Combining Operation

Twofish uses XOR to combine the output of F with the target block. This is done primarily for simplicity; XOR is the most efficient operation in both hardware and software. We chose not to use addition (used in MD4 [Riv91], MD5 [Riv92], RIPE-MD [RIPE92], HAVAL [ZPS93], RIPEMD-160 [DBP96], and SHA [NIST93]), or a more complicated combining function like Latin squares (used in DESV [CDN95]). We also did not implement dynamic swapping [KKT94] or other additional complexities.


Previous Table of Contents Next